Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study
نویسندگان
چکیده
The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afforded by fully synthetic data and to illustrate the specification of synthetic data imputation models. Benefits and limitations of releasing fully synthetic data sets are discussed.
منابع مشابه
Signi cance tests for multi-component estimands from multiply imputed, synthetic microdata
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units or...
متن کاملSampling with Synthesis: A New Approach for Releasing Public Use Census Microdata
Many statistical agencies disseminate samples of census microdata, i.e., data on individual records, to the public. Before releasing the microdata, agencies typically alter identifying or sensitive values to protect data subjects’ confidentiality, for example by coarsening, perturbing, or swapping data. These standard disclosure limitation techniques distort relationships and distributional fea...
متن کاملCombining Methods to Create Synthetic Microdata: Quantile Regression, Hot Deck, and Rank Swapping
Government agencies must simultaneously disseminate useful microdata and maintain confidentiality of individual records. Releasing synthetic data is one approach. We propose to create synthetic data using a combination of quantile regression, hot deck imputation, and rank swapping. The result is a releasable data set containing original values for a few key variables, synthetic quantile regress...
متن کاملDistribution-Preserving Statistical Disclosure Limitation1
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with con dential data replaced by multiply-imputed synthetic values. A mis-speci ed imputation model can invalidate inferences based on the partially synthetic data, because the imputation model determines the distribution of s...
متن کاملDistribution-preserving statistical disclosure limitation
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with con dential data replaced by multiply-imputed synthetic values. A mis-speci ed imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate th...
متن کامل